Reptile: representative tiling for short read error correction
نویسندگان
چکیده
MOTIVATION Error correction is critical to the success of next-generation sequencing applications, such as resequencing and de novo genome sequencing. It is especially important for high-throughput short-read sequencing, where reads are much shorter and more abundant, and errors more frequent than in traditional Sanger sequencing. Processing massive numbers of short reads with existing error correction methods is both compute and memory intensive, yet the results are far from satisfactory when applied to real datasets. RESULTS We present a novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing. Reptile works with the spectrum of k-mers from the input reads, and corrects errors by simultaneously examining: (i) Hamming distance-based correction possibilities for potentially erroneous k-mers; and (ii) neighboring k-mers from the same read for correct contextual information. By not needing to store input data, Reptile has the favorable property that it can handle data that does not fit in main memory. In addition to sequence data, Reptile can make use of available quality score information. Our experiments show that Reptile outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment. In addition, a significant reduction in run time and memory usage have been achieved compared with previous methods, making it more practical for short-read error correction when sampling larger genomes. AVAILABILITY Reptile is implemented in C++ and is available through the link: http://aluru-sun.ece.iastate.edu/doku.php?id=software CONTACT [email protected].
منابع مشابه
SHREC: a short-read error correction method
MOTIVATION Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and acc...
متن کاملPREMIER - PRobabilistic error-correction using Markov inference in errored reads
THIS PAPER IS ELIGIBLE FOR THE STUDENT PAPER AWARD. In this work we present a flexible, probabilistic and reference-free method of error correction for high throughput DNA sequencing data. The key is to exploit the high coverage of sequencing data and model short sequence outputs as independent realizations of a Hidden Markov Model (HMM). We pose the problem of error correction of reads as one ...
متن کاملThe Relationship between Education and Health: Vector Error Correction Model (VECM)
Background & objectives: Despite the importance and impact of health and education on economic growth in countries, the causal relationship between education and health is important to policymaking. This study aimed to investigate the causality relationship between education and health in the short and long runs using the Vector Error Correction Model (VECM) in Iran. Method: This was an analyt...
متن کاملMonetary policy and exchange rate overshooting in Iran: A Vector Errors Correction (VEC) approach
Assumption of exchange rate overshooting has significant position in international macroeconomic discussion. This phenomenon is one of the abnormal behaviors of exchange rate that happen in short run. Dornbusch (1976) shows that because speed of equilibrium prices is slow relative to asset markets and commodity prices are sticky in the short run, However, over time, commodity prices will rise a...
متن کاملImproved long read correction for de novo assembly using an FM-index
Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 26 20 شماره
صفحات -
تاریخ انتشار 2010